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Abstract 

Image segmentation refers to the process to divide an image into non- 
overlapping meaningful regions according to human perception, which has 
become a classic topic since the early ages of computer vision. A lot of re¬ 
search has been conducted and has resulted in many applications. However, 
while many segmentation algorithms exist, yet there are only a few sparse and 
outdated summarizations available, an overview of the recent achievements 
and issues is lacking. We aim to provide a comprehensive review of the re¬ 
cent progress in this field. Covering 180 publications, we give an overview of 
broad areas of segmentation topics including not only the classic bottom-up 
approaches, but also the recent development in superpixel, interactive meth¬ 
ods, object proposals, semantic image parsing and image cosegmentation. 
In addition, we also review the existing influential datasets and evaluation 
metrics. Finally, we suggest some design flavors and research directions for 
future research in image segmentation. 
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1. Introduction 


Human can quickly localize many patterns and automatically group them 
into meaningful parts. Perceptual grouping refers to human’s visual ability 
to abstract high level information from low level image primitives without 
any specific knowledge of the image content. Discovering the working mech¬ 
anisms under this ability has long been studied by the cognitive scientists 
since 1920’s. Early gestalt psychologists observed that human visual system 
tends to perceive the configurational whole with rules governing the psycho¬ 
logical grouping. The hierarchical grouping from low-level features to high 
level structures has been proposed by gestalt psychologists which embodies 
the concept of grouping by proximity, similarity, continuation, closure and 
symmetry. The highly compact representation of images produced by per¬ 
ceptual grouping can greatly facilitate the subsequent indexing, retrieving 
and processing. 

With the development of modern computer, computer scientists ambi¬ 
tiously want to equip the computer with the perceptual grouping ability 
given many promising applications, which lays down the foundation for im¬ 
age segmentation, and has been a classical topic since early years of computer 
vision. Image segmentation, as a basic operation in computer vision, refers 
to the process to divide a natural image into K non-overlapped meaning¬ 
ful entities (e.g., objects or parts). The segmentation operation has been 
proved quite useful in many image processing and computer vision tasks. 
For example, image segmentation has been applied in image annotation [fj 
by decomposing an image into several blobs corresponding to objects. Su¬ 
perpixel segmentation, which transforms millions of pixels into hundreds or 
thousands of homogeneous regions M, has been applied to reduce the 
model complexity and improve speed and accuracy of some complex vision 
tasks, such as estimating dense correspondence held [3], scene parsing [5] and 
body model estimation i- mm have used segmented regions to facilitate 
object recognition, which provides better localization than sliding windows. 
The techniques developed in image segmentation, such as Mean Shift [3] and 
Normalized Cut [10] have also been widely used in other areas such as data 
clustering and density estimation. 

One would expect a segmentation algorithm to decompose an image into 
the “objects” or semantic/meaningful parts. However, what makes an “ob¬ 
ject” or a “meaningful” part can be ambiguous. An “object” can be referred 
to a “thing” (a cup, a cow, etc), a kind of texture (wood, rock) or even a 
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Figure 1: An image from Berkeley segmentation dataset El with hand labels from 
different human subjects, which demonstrates the variety of human perception. 

“stuff” (a building or a forest). Sometimes, an “object” can also be part 
of other “objects”. Lacking a clear definition of “object” makes bottom-up 
segmentation a challenging and ill-posed problem. Fig. [T| gives an example, 
where different human subjects have different ways in interpreting objects. 
In this sense, what makes a ‘good’ segmentation needs to be properly defined. 

Research in human perception has provided some useful guidelines for 
developing segmentation algorithms. For example, cognition study jT2] shows 
that human vision views part boundaries at those with negative minima of 
curvature and the part salience depends on three factors: the relative size, 
the boundary strength and the degree of protrusion. Gestalt theory and 
other psychological studies have also developed various principles reflecting 
human perception, which include: (1) human tends to group elements which 
have similarities in color, shape or other properties; (2) human favors linking 
contours whenever the elements of the pattern establish an implied direction. 

Another challenge which makes “object” segmentation difficult is how to 
effectively represent the “object”. When human perceives an image, elements 
in the brain will be perceived as a whole, but most images in computers are 
currently represented based on low-level features such as color, texture, cur¬ 
vature, convexity, etc. Such low-level features reflect local properties, which 
are difficult to capture global object information. They are also sensitive to 
lighting and perspective variations, which could cause existing algorithms to 
over-segment the image into trivial regions. Fully supervised methods can 
learn higher level and global cues, but they can only handle limited number 
object classes and require per-class labeling. 

A lot of research has been conducted on image segmentation. Unsu- 
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pervised segmentation, as one classic topic in computer vision, has been 
studied since 70’s. Early techniques focus on local region merging and split¬ 
ting da m, which borrow ideas from clustering area. Recent techniques, on 
the other hand, seek to optimize some global criteria da mu iia nans mi- 
Interactive segmentation methods [2D, |2H G±2j which utilize user input, have 
been applied in some commercial products such as Microsoft Office and 
Adobe Photoshop. The substantial development of image classification [23], 
object detection m, superpixcl segmentation [2] and 3D scene recovery [25] 
in the past few years have boosted the research in supervised scene pars¬ 
ing [2D;, 123 m 2D-EBGCQ • With the emergence of large-scale image databases, 
such as the ImageNet [32J and personal photo streams on Flickr, the coseg- 
mentation methods [33[ EH [351 [Ml EH ESI EH 00] which can extract recur¬ 
ring objects from a set of images, has attracted increasing attentions in these 
years. 

Although image segmentation as a community has been evolving for a long 
time, the challenges ranging from feature representation to model design and 
optimization are still not fully resolved, which hinder further performance im¬ 
provement towards human perception. Thus, it is necessary to periodically 
give a thorough and systematic review of the segmentation algorithms, espe¬ 
cially the recent ones, to summarize what have been achieved, where we are 
now, what knowledge and lessons can be shared and transferred between dif¬ 
ferent communities and what are the directions and opportunities for future 
research. To our surprise, there are only some sparse reviews on segmenta¬ 
tion literature, there is no comprehensive review which covers broad areas 
of segmentation topics including not only the classic bottom-up approaches, 
but also the recent development in superpixel, interactive methods, object 
proposals, semantic image parsing and image cosegmentation, which will be 
critically and exhaustively reviewed in this paper in Sections [2}{7| In addition, 
we will also review the existing influential datasets and evaluation metrics in 
Section [8} Finally, we discuss some popular design flavors and some potential 
future directions in Section [9j and conclude the paper in Section 

2. Bottom-up methods 

The bottom-up methods usually do not take into account the explicit no¬ 
tion of the object, and their goal is mainly to group nearby pixels according to 
some local homogeneity in the feature space, e.g. color, texture or curvature, 
by clustering those features based on fitting mixture models, mode shift- 
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ing m or graph partitioning ns] mmm- In addition, the variational [IB] 
and level set [19] techniques have also been used in segmenting images into 
regions. Below we give a brief summary of some popular bottom-up methods 
due to their reasonable performance and publicly available implementation. 
We divide the bottom-up methods into two major categories: discrete meth¬ 
ods and continues methods, where the former considers an image as a fixed 
discrete grid while the latter treats an image as a continuous surface. 


2.1. Discrete bottom-up methods 

K-Means: K-means is among the simplest and most efficient method. 
Given k initial centers which can be randomly selected, K-means first assign 
each sample to one of the centers based on their feature space distance and 
then the centers are updated. These two steps iterate until the termination 
condition is met. Assuming that cluster number is known or the distributions 
of the clusters are spherically symmetric, K-means works efficiently. However, 
these assumptions often don’t meet in general cases, and thus K-means could 
have problems when dealing with complex clusters. 

Mixture of Gaussians: This method and K-means are similar to each 
other. Within mixture of Gaussians, each cluster center is now replaced by a 
covariance matrix. Assume a set of d-dimensional feature vector aq, x 2 ,...,x n 
which are drawn from a Gaussian mixture: 


p(x\{n k ,= y^7r fc Af(:r|/x fc ,S fc ) (1) 

k 

where n k are the mixing weights, are the means and covariances, and 


N(x\pk, E fc ) 


exp{-\{x - /i fc ) T H fc 1 (a; - p k )} 

(27r)<V 2 |£,|V2 


( 2 ) 


is the normal equation 0U. The parameters iTk, E k can be estimated by 

using expectation maximization (EM) algorithm as follows [H] , 

1. The E-step estimates how likely a sample Xi is generated from the fcth 
Gaussian clusters with current parameters: 


Zik ^ A^(Xj l/T/j, Efc) 


( 3 ) 


with Yjk z ik = 1 and Z = N(xi\/x k , E fc ). 
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2. The M-step updates the parameters: 

fJ'k ^ ^ Zik%ii 

n k “ 

i 

*1 

Sfc = — z ik (xi - Hk)(xi - Hk) T , ( 4 ) 

n k 

n k 

n 

where n k = JA estimates the number of sample points in each clus¬ 
ter. 

After the parameters are estimated, the segmentation can be formed by 
assigning the pixels to the most probable cluster. Recently, Rao et al m 
extended the gaussian mixture models by encoding the texture and boundary 
using minimum description length theory and achieved better performance. 

Mean Shift: Different from the parametric methods such as K-means 
and Mixture of Gaussian which have assumptions over the cluster number 
and feature distributions, Mean-shift [3j is a non-parametric method which 
can automatically decide the cluster number and modes in the feature space. 

Assume the data points are drawn from some probability function, whose 
density can be estimated by convolving the data with a fixed kernel of width 
h: 

f(n) = Yl K ( x ~Xi)='52H ^ X h 2^ ) ( 5 ) 

i i 

where x t is an input sample and k(.) is the kernel function [U], After the 
density function is estimated, mean shift uses a multiple restart gradient 
descent method which starts at some initial guess y k , then the gradient di¬ 
rection of f(x) is estimated at y k and uphill step is taken in the direction pLTj. 
Particularly, the gradient of f(x) is given by 


V/(x) = - x)G(x - Xi) 

i 


= ^(Xi - x)g( 

i 



where 

g(r ) = -k\r) 


( 6 ) 

(7) 
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and k'(.) is the first-order derivative of k(.). V f(x) can be re-written as 


V/(x) 


^G(x-Xj) 


m(x) 


( 8 ) 


where the vector 


m{x) 


J2jXjG(x - Xj) 

£i G(x - xj 


(9) 


is called the mean shift vector which is the difference between the mean of 
the neighbors around x and the current value of x. During the mean-shift 
procedure, the current mode yk is replaced by its locally weighted mean: 


yt+1 = Vt + m(yi i). 


( 10 ) 


Final segmentation is formed by grouping pixels whose converge points are 
closer than h s in the spatial domain and h r in the range domain, and these two 
parameters are tuned according to the requirement of different applications. 

Watershed: This method [16] segments an image into the catchment 
and basin by flooding the morphological surface at the local minimum and 
then constructing ridges at the places where different components meet. As 
watershed associates each region with a local minimum, it can lead to serious 
over-segmentation. To mitigate this problem, some methods ra allow user 
to provide some initial seed positions which helps improve the result. 

Graph Based Region Merging: Unlike edge based region merging 
methods which use fixed merging rules, Felzenszwalb and Huttenlocher na 
advocates a method which can use a relative dissimilar measure to produce 
segmentation which optimizes a global grouping metric. The method maps 
an image to a graph with a 4-neighbor or 8-neighbor structure. The pixels 
denote the nodes, while the edge weights reflect the color dissimilarity be¬ 
tween nodes. Initially each node forms their own component. The internal 
difference Int(C ) is defined as the largest weight in the minimum spanning 
tree of a component C. Then the weight is sorted in ascending order. Two 
regions C\ and C 2 are merged if the in-between edge weight is less than 
min(Int(Ci ) + r(C'i), Int(C 2 ) + r(C , 2 )), where r(C) = k/\C\ and A; is a co¬ 
efficient that is used to control the component size. Merging stops when the 
difference between components exceeds the internal difference. 

Normalized Cut: Many methods generate the segmentation based on 
local image statistics only, and thus they could produce trivial regions be¬ 
cause the low level features are sensitive to the lighting and perspective 
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changes. In contrast, Normalized Cut [32] finds a segmentation via splitting 
the affinity graph which encodes the global image information, i.e. minimiz¬ 
ing the Ncut value between different clusters: 


Ncut(S 1 ,S 2 ,...,S k ) 

i= 1 


W(Si,Si) 

vol(Si) 


( 11 ) 


where Si, S 2 , ..., form a ^-partition of a graph, 5) is the complement of Si, 
W(Si, Si) is the sum of boundary edge weights of S tl and vol(Si) is the sum 
of weights of all edges attached to vertices in Si. The basic idea here is that 
big clusters have large vol(Si) and minimizing Ncut encourages all vol(Si) to 
be about the same, thus achieving a “balanced” clustering. 

Finding the normalized cut is an NP-hard problem. Usually, an approx¬ 
imate solution is sought by computing the eigenvectors v of the generalized 
eigenvalue system (D — W)v = A Dv, where W = [ Wij \ is the affinity matrix 
of an image graph with describing the pairwise affinity of two pixels and 
D = [dij\ is the diagonal matrix with da = Ylj w ij- l 11 the seminal work of 
Shi and Malik [ 32 ], the pair-wise affinity is chosen as the Gaussian kernel of 
the spatial and feature difference for pixels within a radius ||xj — Xj\ < r: 

11 F — F • 11 ^ I I T • — T • I I ^ 

Wij = exp(- — 2 ^ ~ Jl jN ) (12) 

°F a s 

where F is a feature vector consisting of intensity, color and Gabor features 
and crp and a s are the variance of the feature and spatial position, respec¬ 
tively. In the later work of Malik et al. m, they define a new affinity matrix 
using an intervening contour method. They measure the difference between 
two pixel i and j by inspecting the probability of an obvious edge alone the 
line connecting the two pixels: 


= exp(— max mPb(p)/a ) (13) 

p£ij 

where ij is the line segment connecting i and j and cr is a constant, the 
mPb(p) is the boundary strength defined at pixel p by maximizing the ori¬ 
ented contour signal mPb(p , 6) at multiple orientations 9 : 

mPb(p ) = max mPb(p , 6) 

6 


( 14 ) 





The oriented contour signal mPb(p, 9) is defined as a linear cobination of 
multiple local cuess at orientation 9: 


mPb(p , 9) = (is) 

s i 

where Gi t<y ^ s ) (p, 9) measures the % 2 distance at feature channel i (brightness, 
color a, color b, texture) between the histograms of the two halves of a disc 
of radius cr(i, s ) divided at angle 9, and a t . s is the combination weight by gra¬ 
dient ascent on the F-measure using the training images and corresponding 
ground-truth. Moreover, the affinity matrix can also be learned by using the 
recent multi-task low-rank representation algorithm presented in [45] . 

The segmentation is achieved by recursively bi-partitioning the graph us¬ 
ing the first nonzero eigenvalue’s eigenvector ra or spectral clustering of a 
set of eigenvectors [10]. For the computational efficiency purpose, spectral 
clustering requires the affinity matrix to be sparse which limits its applica¬ 
tions. Recent work of Cour et al. m solves this limitation by defining the 
affinity matrix at multiple scale and then setting up cross-scale constraints 
which achieve better result. In addition, Arbelaez et al. [48] convolve eigen¬ 
vectors with Gaussian directional derivatives at multiple orientations 9 to 
obtain oriented spectral contours responses at each pixel p: 


sPb(p, 


0 ) = 



■ V 0 v fc (p) 


(16) 


Since the signal mPb and sPb carries different contour information, Arbelaez 
et al. [48] proposed to combine them to globalized the contour information 
gPb- 

gPb(p , 9) = (3 ■ mPb(p , 9) + 7 • sPb(p, 9) (17) 

where the combination weights f3 and 7 are also learned by gradient ascent 
on the F-measure using ground truth, which achieved the state-of-the-art 
contour detection result. 

Edge Based Region Merging: Such methods [T4J[49] start from pix¬ 
els or super-pixels, and then two adjacent regions are merged based on the 
metrics which can reflect their similarities/dissimilarities such as boundary 
length, edge strength or color difference. Recently, Arbelaez et al. [50] pro¬ 
posed the gPb-OWT-UCM method to transform a set of contours, which 
are generated from the Normalized Cut framework (to be introduced later), 
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into a nested partition of the image. The method first generate a probability 
edge map gPb (see Eqfl7|) which delineates the salient contours. Then, it 
performs watershed over the topological space defined by the gPb to form 
the finest level segmentation. Finally, the edges between regions are sorted 
and merged in an ascending order which forms the ultrametric contour map 
UCM. Thresholding UCM at a scale A forms the final segmentation. 


2.2. Continuous methods 

Variational techniques [IS HH EH E21153] have also been used in segment¬ 
ing images into regions, which treat an image as a continuous surface instead 
of a fixed discrete grid and can produce visually more pleasing results. 

Mumford-Shah Model: The Mumford-Shah (MS) model partitions an 
image by minimizing the functional which encourages homogeneity within 
each region as well as sharp piecewise regular boundaries. The MS functional 
is defined as 


Fms(I,C) = / \I-I 0 \ 2 dx + g / | VI\ 2 dx + uH N -\C) (18) 

J uj J uj\C 

for any observed image J 0 and any positive parameters /i, u, where / corre¬ 
sponds to a piecewise smooth approximation of / 0 , C represents the boundary 
contours of / and its length is given by Hausdorff measure H N ~ l {C). The 
first term of (18) is a fidelity term with respect to the given data /q, the 


second term regularizes the function / to be smooth inside the region u\C 
and the last term imposes a regularization constraint on the discontinuity 
set C to be smooth. 

Since minimizing the Mumford-Shah model is not easy, many variants 
have been proposed to approximate the functional p3, 55] EE) 157] , Among 
them, Vese-Chan[58j proposed to approximate the term H N ~ 1 (C) by the 
lengths of region contours, which provides the model of active contour without 
edges. By assuming the region is piecewise constant, the model is further 
simplified to the continuous Potts model, which has convexified solvers 


Active Contour / Snake Model [59]: This type of models detects 
objects by deforming a snake/contour curve C towards the sharp image edges. 
The evolution of parametric curve C(p) = {x(p),y(p)),p G {0,1} is driven 
by minimizing the functional: 

F{CT) = a£\‘^-\ 2 dp + /3 J\^\ 2 dp+\J^ f 2 (I 0 (C))dp (19) 
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where the first two terms enforce smoothness constraints by making the snake 
act as a membrane and a thin plate correspondingly, and the sum of the first 
two terms makes the internal energy. The third term, called external energy , 
attracts the curve toward the object boundaries by using the edge detecting 
function 

1 

l + ylVt/o*^)! 2 

where 7 is an arbitrary positive constant and Io*G a is the Gaussian smoothed 
version of Iq. The energy function is non-convex and sensitive to initializa¬ 
tion. To overcome the limitation, Osher et al. [19|J proposed the level set 
method , which implicitly represents curve C by a higher dimension 0, called 
the level set function. Moreover, Bresson et al. [5TJ proposed the convex 
relaxed active contour model which can achieve desirable global optimal. 

The bottom-up methods can also be classified into another two categories: 
the ones |15] |10| [HTTj which attempt to produce regions likely belonging to 
objects and the ones which tend to produce over-segmentation [16] |3] (to be 
introduced in Section [3]). For methods of the first category, obtaining object 
regions is extremely challenging as the bottom-up methods only use low-level 
features. Recently, Zhu et al. [60] proposed to combine hand-crafted low-level 
features that can reflect global image statistics (such as Gaussian Mixture 
Model, Geodesic Distance and Eigenvectors) with the convexified continuous 
Potts model to capture high-level structures, which achieves some promising 
results. 

3. Superpixel 

Superpixel methods aim to over-segment an image into homogeneous re¬ 
gions which are smaller than object or parts. In the seminal work of Ren 
and Malik [HI], they argue and justify that superpixel is more natural and 
efficient representation than pixel because local cues extracted at pixel are 
ambiguous and sensitive to noise. Superpixel has a few desirable properties: 

• It is perceptually more meaningful than pixel. The superpixel 
produced by state-of-the-art algorithms is nearly perceptually consis¬ 
tent in terms of color, texture, etc, and most structures can be well 
preserved. Besides, superpixel also conveys some shape cues which is 
difficult to capture at pixel level. 
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• It helps reduce model complexity and improve efficiency and 
accuracy. Existing pixel-based methods need to deal with millions of 
pixels and their parameters, where training and inference in such a big 
system pose great challenges to current solvers. On the contrary, using 
superpixels to represent the image can greatly reduce the number of 
parameters and alleviate computation cost. Meanwhile, by exploiting 
the large spatial support of superpixel, more discriminative features 
such as color or texture histogram can be extracted. Last but not 
least, superpixel makes longer range information propagation possible, 
which allows existing solvers to exploit richer information than those 
using pixel. 

There are different paradigms to produce superpixels: 

• Some existing bottom-up methods can be directly adapted to over¬ 
segmentation scenario by tuning the parameters, e.g. Watersheds, Nor¬ 
malized Cut (by increasing cluster number), Graph Based Merging (by 
controlling the regions size) and Mean-Shift/Quick-Shift (by tuning the 
kernel size or changing mode drifting style). 

• Some recent methods produce much faster superpixel segmentation by 
changing optimization scope from the whole image to local non-overlap 
initial regions, and then adjusting the region boundaries to snap to 
salient object contours. TurboPixcl [62] deforms the initial spatial grid 
to compact and regular regions by using geometric flow which is di¬ 
rected by local gradients. Wang et al. [63] also adapted geodesic flows 
by computing geodesic distance among pixels to produce adaptive su¬ 
perpixels, which have higher density in high intensity or color varia¬ 
tion regions while having larger superpixels at structure-less regions. 
Veskler et al. [Mj proposed to place overlapping patches at the im¬ 
age, and then assigned each pixel by inferring the MAP solution using 
graph-cuts. Zhang et al. [66] further studied in this direction by using 
a pseudo-boolean optimization which achieves faster speed. Achanta 
et al. [2J introduced the SLIC algorithm which greatly improves the 
superpixel efficiency. SLIC starts from the initial regular grid of su¬ 
perpixels, grows superpixels by estimating each pixel’s distance to its 
cluster center localized nearby, and then updates the cluster centers, 
which is essentially a localized K-means. SLIC can produce superpixels 
at 5Hz without GPU optimization. 
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Figure 2: Examples of interactive image segmentation using bounding box (a ~ b) [2(Jj 
or scribbles (c ~ d) [2TJ. 


• There are also some new formulations for over-segmentation. Liu et 
al. [66] recently proposed a new graph based method, which can max¬ 
imize the entropy rate of the cuts in the graph with a balance term 
for compact representation. Although it outperforms many methods 
in terms of boundary recall measure, it takes about 2.5s to segment an 
image of size 480x320. Van den Berge et al. [67J proposed the fastest 
superpixel method-SEED, which can run at 30Hz. SEED uses multi¬ 
resolution image grids as initial regions. For each image grid, they 
define a color histogram based entropy term and an optional boundary 
term. Instead of using EM as in SLIC, which needs to repeatedly com¬ 
pute distances, SEEDs uses Hill-Climbing to move coarser-resolution 
grids, and then refines region boundary using finer-resolution grids. 
In this way, SEEDs can achieve real time superpixel segmentation at 
30Hz. 


4. Interactive methods 

Image segmentation is expected to produce regions matching human per¬ 
ception. Without any prior assumption, it is difficult for bottom-up methods 
to produce object regions. For some specific areas such as image editing and 
medical image analysis, which require precise object localization and seg¬ 
mentation for subsequent applications (e.g. changing background or organ 
reconstruction), prior knowledge or constraints (e.g. color distribution, con¬ 
tour enclosure or texture distribution) directly obtained from a small amount 
of user inputs can be of great help to produce accurate object segmentation. 
Such type of segmentation methods with user inputs are termed as interactive 
image segmentation methods. There are already some surveys [SHI EH t ZD] on 
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interactive segmentation techniques, and thus this paper will act as a supple¬ 
mentary to them, where we discuss recent influential literature not covered 
by the previous surveys. In addition, to make the manuscript self-contained, 
we will also give a brief review of some classical techniques. 

In general, an interactive segmentation method has the following pipeline: 
1) user provides initial input; 2) then segmentation algorithm produces seg¬ 
mentation results; 3) based on the results, user provides further constraints, 
and then go back to step 2. The process repeats until the results are satis¬ 
factory. A good interactive segmentation method should meet the following 
criteria: 1) offering an intuitive interface for the user to express various con¬ 
straints; 2) producing fast and accurate result with as little user manipulation 
as possible; 3) allowing user to make additional adjustments to further refine 
the result. 

Existing interactive methods can be classified according to the difference 
of the user interface, which is the bridge for the user to convey his prior 
knowledge (see Fig. [2]). Existing popular interfaces include bounding box [2D], 
polygon (the object of interest is within the provided regions), contour [71, 72] 
(the object boundary should follow the provided direction), and scribbles [73j 
l2h| (the object should follow similar color distributions). According to the 
methodology, the existing interactive methods can be roughly classified into 
two groups: (i) contour based methods and (ii) label propagation based 
methods 

4-1- Contour based interactive methods 

Contour based methods are one type of the earliest interactive segmen¬ 
tation methods. In a typical contour based interactive method, the user first 
places a contour close to the object boundary, and then the contour will 
evolve to snap to the nearby salient object boundary. One of the key com¬ 
ponents for contour based methods is the edge detection function, which can 
be based on first-order image statistics (such as operators of Sobel, Canny, 
Prewitt, etc), or more robust second-order statistics (such as operators of 
Laplacian, LoG, etc). 

For instance, the Live-Wire / Intelligent Scissors method puzu starts 
contour evolving by building a weighted graph on the image, where each 
node in the graph corresponds to a pixel, and directed edges are formed 
around pixels with their closest four neighbors or eight neighbors. The local 
cost of each direct edge is the weighted sum of Laplacian zero-cross, gradient 
magnitude and gradient direction. Then given the seed locations, the shortest 
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path from a seed point p to a certain seed point s is found by using Dijikstras 
method. Essentially, Live-Wire minimizes a local energy function. On the 
other hand, the active contour method introduced in Section T2 deforms 
the contour by using a global energy function, which consists of a regional 
term and a boundary term, to overcome some ambiguous local optimum. 
Nguyen et al. [22] recently employed the convex active contour model to refine 
the object boundary produced by other interactive segmentation methods 
and achieved satisfactory segmentation results. Liu et al. |S1 proposed to 
use the level set function [T9j to track the zero level set of the posterior 
probabilistic mask learned from the user provided bounding box, which can 
capture objects with complex topology and fragmented appearance such as 
tree leaves. 


4-2. Label propagation based methods 

Label Propagation methods are more popular in literature. The basic idea 
of label propagation is to start from user-provided initial input marks, and 
then propagate the labels using either global optimization (such as Graph- 
Cut m or RandomWalk E® or local optimization, where the global opti¬ 
mization methods are more widely used due to the existence of fast solvers. 

GraphCut and its decedents [731 1201 [77J model the pixel labeling 
problem in Markov Random Field. Their energy functions can be generally 
expressed as 

E(X) = y; E u (xj) + Y E p (xi,Xj ) (21) 

Xi&X Xi,Xj£j\f 


where X = xi,X 2 , ■■■,x n is the set of random variables defined at each pixel, 
which can take either foreground label 1 or background label 0; A f defines 
a neighborhood system, which is typically 4-neighbor or 8-neighbor. The 
first term in (21) is the unary potential, and the second term is the pairwise 
potential. In [73], the unary potential is evaluated by using an intensity 
histogram. Later in GrabCut ra, the unary potential is derived from the two 
Gaussian Mixture Models (GMMs) for background and foreground regions 
respectively, and then a hard constraint is imposed on the regions outside the 
bounding box such that the labels in the background region remain constant 
while the regions within the box are updated to capture the object. Li et 
al. E3 further proposed the LazySnap method, which extends GrabCut by 
using superpixels and including an interface to allow user to adjust the result 
at the low-contrast and weak object boundaries. One limitation of GrabCut 
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is that it favors short boundary due to the energy function design. Recently 
Kohli et al. [75] proposed to use a conditional random field with multiple¬ 
layered hidden units to encode boundary preserving higher order potential, 
which has efficient solvers and therefore can capture thin details that are 
often neglected by classical MRF model. 

Since MRF/CRF model provides a unified framework to combine mul¬ 
tiple information, various methods have been proposed to incorporate prior 
knowledge: 

• Geodesic prior: One typical assumption on objects is that they are 
compactly clustered in spatial space, instead of distributing around. 
Such spatial constraint can be constructed by exploiting the geodesic 
distance to foreground and background seeds. Unlike the Euclidean 
distance which directly measures two-point distance in spatial space, 
the geodesic distance measures the lowest cost path between them. 
The weight is set to the gradient of the likelihood of pixels belonging 
to the foreground. Then the geodesic distance can be incorporated as 
a kind of data term in the energy function as in O- Later, Bai et 
al. [80] further extended the solution to soft matting problem. Zhu 
et al. [60] also incorporated the Bais geodesic distance as one type 
of object potential to produce bottom-up object segmentation, which 
achieves improved results. 

• Convex prior: Another common assumption is that most objects are 
convex. Such prior can be expressed as a kind of labeling constraints. 
For example, a star shape is defined with respect to a center point 
c. An object has a star shape if for any point p inside the object, all 
points on the straight line between the center c and p also lie inside 
the object. Veskler et al. [81] formulated such constraint by penalizing 
different labeling on the same line, which such formulation can only 
work with single convex center. Gulshan et al. [82] proposed to use 
geodesic distance transform [83] to compute the geodesic convexity from 
each pixel to the star center, which works on objects with multiple start 
centers. Other similar connectivity constraint has also been studied 
in [83] . 

RandomWalk: The Random Walk model ESI provides another pixel 
labeling framework, which has been applied to many computer vision prob¬ 
lems, such as segmentation, denoising and image matching. With notations 
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similar to those for GraphCut in (21), Random Walk starts from building an 


undirected graph G = (V,E), where V is the set of vertices defined at each 
pixel and E is the set of edges which incorporate the pairwise weight Wij to 
reflect the probability of a random walker to jump between two nodes i and 
j. The degree of a vertex is defined as 4 = JY W VJ . 

Given the weighted graph, a set of user scribbled nodes V m and a set 
of unmarked nodes 14, such that V m U V u — V and V m fl V u — <P, the 
Random Walk approach is to assign each node i 6 14 a probability x* that 
a random walker starting from that node first reaches a marked node. The 
final segmentation is formed by thresholding x t . 

The entire set of node probability x can be obtained by minimizing 


E(x) = x T Lx 

where L represents the combinatorial Laplacian matrix defined as 


( 22 ) 


di if i = j 

La = { —w^ if i 4 j and i and j are adjacent nodes 
0 otherwise. 


(23) 


By partitioning the matrix L into marked and unmarked blocks as 

L m B 
B t L u _ 

and defining a |V 4 | x 1 indicator vector / as 


(24) 


fi = 


1 if j is labeled as foreground 
0 otherwise, 


the minization of ( 22 ) with respect to x u results in 

L u x u = -B T f 


(25) 


(26) 


where only a sparse linear system needs to be solved. Yang et al. EJ further 
proposed a constrained Random Walk that is able to incorporate different 
types of user inputs as additional constraints such as hard and soft constraints 
to provide more flexibility for users. 

Local optimization based methods: Many label propagation meth¬ 
ods use specific solvers such as graph cut and belief propagation to get the 
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most likely solution. However, such global optimization type of solvers are 
often slow and do not scale with image size. Hoslii et al. |85j proved that 
by filtering the cost volume using fast edge preserving filters (such as Joint 
Bilateral Filter [86] . Guidede Filter |87], Cross-map filter [88], etc) and then 
using Winner-Takes-All label selection to take the most likely labels, they 
can achieve comparable or better results than global optimized models. More 
importantly, such cost-volume filtering approaches can achieve real time per¬ 
formance, e.g. 2.85 ms to filter an 1 Mpix image on a Core 2 Quad 2.4GHZ 
desktop. Recently, Crimisi et al. [83] proposed to use geodesic distance trans¬ 
form to filter the cost volume, which can produce results comparable to the 
global optimization methods and can better capture the edges at the weak 
boundaries. 

There are also some simple region growing based methods [89] ED], which 
start from user-drawn seeded regions, and then iteratively merge the remain¬ 
ing groups according to similarities. One limitation of such methods is that 
different merging orders can produce different results. 

5. Object Proposals 

Automatically and precisely segmenting out objects from an image is still 
an unsolved problem. Instead of searching for deterministic object segmen¬ 
tation, recent research on object proposals relaxes the object segmentation 
problem by looking for a pool of regions that have high probability to cover 
the objects by some of the proposals. This type of methods leverages high 
level concept of “object” or “thing” to separate object regions from “stuff”, 
where an “object” tends to have clear size and shape (e.g. pedestrian, car), 
as opposed to “stuff” (e.g. sky or grass) which tends to be homogeneous or 
with recurring patterns of fine-scale structure. 

Class-specific object proposals: One approach to incorporate the ob¬ 
ject notion is through the use of class-specific object detectors. Such object 
detectors can be any bounding box detector, e.g. the famous Deformable Part 
Model (DPM) [23J or Poselets |9T] . A few works have been proposed to com¬ 
bine object detection with segmentation. For example, Larlus and Jurie [92] 
obtained the object segmentation by refining the bounding box using CRF. 
Gu et al. [T] proposed to use hierarchical regions for object detection, instead 
of bounding boxes. However, class-specific object segmentation methods can 
only be applied to a limited number of object classes, and cannot handle 
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(a) (b) 

Figure 3: An example of class-independent object segmentation [8]. 


large number of object classes, e.g. ImageNet [32], which has thousands of 
classes. 

Class-independent object proposals: Inspired by the objectness win¬ 
dow work of Alexi et al. [23], methods of class-independent region proposals 
directly attempt to produce general object regions. The underlying ratio¬ 
nale for class-independent proposals to work is that the object-of-interest is 
typically distinct from background in certain appearance or geometry cues. 
Hence, in some sense general object proposal problem is correlated with the 
popular salient object detection problem. A more complete survey of state- 
of-the-art objectness window methods and salient object detection methods 
can be found in [93] and [95], respectively. Here we focus on the region 
proposal methods. 

One group of class-independent object proposal methods is to first use 
bottom-up methods to generate regions proposals, and then a pre-trained 
classifier is applied to rank proposals according to the region features. The 
representative works include CPMC [8] and Category Independent Object 
Proposal [9], which extend GrabCut to this scenario by using different seed 
sampling techniques (see Fig. [3]for an example). In particular, CPMC applies 
sequential GrabCuts by using the regular grid as foreground seeds and image 
frame boundary as background seeds to train the Gaussian mixture mod¬ 
els. Category Independent Object Proposal samples foreground seeds from 
occlusion boundaries inferred by [96] • Then the generated regions proposals 
are fed to random forests classifier or regressor, which is trained with the the 
features (such as convexity, color contrast, etc) of the ground truth object 
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regions, for re-ranking. One bottleneck of such appearance based object pro¬ 
posal methods is that re-computing GrabCut’s appearance model implicitly 
incorporates the re-computation of some distance, which makes such meth¬ 
ods hard to speed up. A recent work in ra pre-computes a graph which can 
be used for parametric min-cuts over different seeds, and then dynamic graph 
cut PS] is applied to accelerate the speed. Instead of computing expensive 
GrabCut, another recent work called Geodesic Object Proposal (GOP) [ 99] 
exploits the fast geodesic distance transform for object proposals. GOP uses 
superpixels as its atomic units, chooses the first seed location as the one whose 
geodesic distance is the smallest to all other superixels, and then places next 
seeds far to the existing seeds. The authors also proposed to use RankSVM 
to learn to place seeds, but the performance improvement is not significant. 

Another bottom-up way to generate region proposals is to use the edge 
based region merging method, gPb-OWT-UCM, described in Section[2] Then, 
the region hierarchies are filtered using the classifiers of CPMC. Due to its 
reliance on the high accuracy contour map, such method can achieve higher 
accuracy than GrabCut based methods. However, such method is limited by 
the performance of contour detectors, whose shortcomings on speed and accu¬ 
racy have been greatly improved by some recently introduced learning based 
methods such as Structured Forest HDD! or the multi-resolution eigen-solvers. 
The later solver has been applied in an improved version of gPb-OWT-UCM 
(MCG) [ 101] which simultaneously considers the region combinatorial space. 

Different from the above class-independent object proposal methods, which 
use single strategy to generate object regions, another group of methods ap¬ 
ply multiple hand-crafted strategies to produce diversified solutions during 
the process of atomic unit generation and region proposal generation, which 
we call diversified region proposal methods. This type of methods typically 
just produce diversified region proposals, but do not train classifier to rank 
the proposals. For example, SclectiveSearch [ 102j generates region trees from 
superpixels to capture objects at multiple scales by using different merging 
similarity metrics (such as RGB, Intensity, Texture, etc). To increase the 
degree of diversification, ScleciveSearch also applies different parameters to 
generate initial atomic superpixels. After different segmentation trees are 
generated, the detection starts from the regions in higher hierarchies. Ma- 
nen et al. [ 103j proposed a similar method which exploits merging randomized 
trees with learned similarity metrics. 
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Figure 4: An example of semantic segmentation [30| which is trained using pictures with 
human labeled ground truth such as (b) to segment the test image in (c) and produce the 
final segmentation in (d). 


6. Semantic Image Parsing 

Semantic image parsing aims to break an image into non-overlapped re¬ 
gions which correspond to predefined semantic classes (e.g. car, grass, sheep, 
etc), as shown in Fig. [4] . The popularity of semantic parsing since early 
2000s is deeply rooted in the success of some specific computer vision tasks 
such as face/object detection and tracking, camera pose estimation, multi¬ 
ple view 3D reconstruction and fast conditional random field solvers. The 
ultimate goal of semantic parsing is to equip the computer with the holistic 
ability to understand the visual world around us. Although also depending 
on the given information, high-level learned representations make it different 
from the interactive methods. The learned models can be used to predict 
similar regions in new images. This type of approaches is also different from 
the object region proposal in the sense that it aims to parse an image as a 
whole into the “thing” and “stuff” classes, instead of just producing possible 
“thing” candidates. 

Most state-of-the-art image parsing systems are formulated as the prob¬ 
lem of finding the most probable labeling on a Markov random field (MRF) 
or conditional random field (CRF). CRF provides a principled probabilis¬ 
tic framework to model complex interactions between output variables and 
observed features. Thanks to the ability to factorize the probability distribu¬ 
tion over different labeling of the random variables, CRF allows for compact 
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representations and efficient inference. The CRF model defines a Gibbs dis¬ 
tribution of the output labeling y conditioned on observed features x via an 
energy function £'(y;x): 

^( y l x ) = ^yexp{-E( y; x)}, (27) 

where y = (y±, y 2 , -,y n ) is a vector of random variables y % (n is the number of 
pixels or regions) defined on each node i (pixel or superpixel), which takes a 
label from a predefined label set L given the observed features x. Z(x) here 
is called partition function which ensures the distribution is properly normal¬ 
ized and summed to one. Computing the partition function is intractable due 
to the sum of exponential functions. On the other hand, such computation is 
not necessary given the task is to infer the most likely labeling. Maximizing 


a posterior of (27) is equivalent to minimize the energy function E( y;x). A 


common model for pixel labeling involves a unary potential ip u (yi'x) which 
is associated with each pixel, and a pairwise potential ^ p (yi,yj) which is 
associated with a pair of neighborhood pixels: 


£(y; x ) = 


Z— 1 


Vi,Vj£jV 




(28) 


Given the energy function, semantic image parsing usually follows the 
following pipelines: 1) Extract features from a patch centered on each pixel; 
2) With the extracted features and the ground truth labels, an appearance 
model is trained to produce a compatible score for each training sample; 3) 
The trained classifier is applied on the test image’s pixel-wise features, and 
the output is used as the unary term; 4) The pairwise term of the CRF 
is defined over a 4 or 8-connected neighborhood for each pixel; 5) Perform 
maximum a posterior (MAP) inference on the graph. Following this common 
pipeline, there are different variants in different aspects: 


• Features: The commonly used features are bottom-up pixel-level fea¬ 
tures such as color or texton. He et al. [104] proposed to incorporate 
the region and image level features. Shotton et al. [105] proposed to use 
spatial layout filters to represent the local information corresponding 
to different classes, which was later adapted to random forest frame¬ 
work for real-time parsing [23]. Recently, deep convolutional neural 
network learned features [1 UB] have also been applied to replace the 
hand-crafted features, which achieves promising performance. 
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• Spatial support: The spatial support in step 1 can be adapted to su¬ 
perpixels which conform to image internal structures and make feature 
extraction less susceptible to noise. Also by exploiting superpixels, the 
complexity of the model is greatly reduces from millions of variables to 
only hundreds or thousands. Hoiem et al. sna used multiple segmenta¬ 
tions to find out the most feasible configuration. Tighe et al. [108] used 
superpixels to retrieve similar superpixels in the training set to gener¬ 
ate unary term. To handle multiple superpixel hypotheses, Ladicky 
et al. [ 109j proposed the robust higher order potential, which enforces 
the labeling consistency between the superpixels and their underlying 
pixels. 

• Context: The context (such as boat in the water, car on the road) has 
emerged as another important factor beyond the basic smoothness as¬ 
sumption of the CRF model. Basic context model is implicitly captured 
by the unary potential, e.g. the pixels with green colors are more likely 
to be grass class. Recently, more sophisticated class co-occurrence in¬ 
formation has been incorporated in the model. Rabinovich et al. ESI 
learned label co-occurence statistic in the training set and then incor¬ 
porated it into CRF as additional potential. Later the systems using 
multiple forms of context based on co-occurence, spatilal adjacency and 
appearance have been proposed in [ 111 , 1121 1108] . Ladicky et al. [31] 
proposed an efficient method to incorporate global context, which pe¬ 
nalizes unlikely pairs of labels to be assigned anywhere in the image by 
introducing one additional variable in the GraphCut model. 

• Top-down and bottom-up combination: The combination of top- 
down and bottom-up information has also been considered in recent 
scene parsing works. The bottom-up information is better at captur¬ 
ing stuff class which is homogeneous. On the other hand, object de¬ 
tectors are good at capturing thing class. Therefore, their combination 
helps develop a more holistic scene understanding system. There are 
some recent studies incorporating the sliding window detectors such as 
Deformable Part Model m or Poselet [ST]. Specifically, Ladicky et 
al. mu proposed the higher order robust potential based on detectors 
which use GrabCut to generate the shape mask to compete with bot¬ 
tom up cues. Floras et al. }115] instead infered the shape mask from the 
Implicit Shape Model top-down segmentation system mg. Arbelaez 
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et al. m used the Poselet detector to segment articulated objects. 
Guo and Hoiem [ 118 j chose to use Auto-Context [ 119] to incorporate 
the detector response in their system. More recently, Tighe et al. [29] 
proposed to transfer the mask of training images to test images as the 
shape potential by using trained exemplar SVM model and achieved 
state-of-the-art scene parsing results. 

• Inference: To optimize the energy function, various techniques can be 
applied, such as GraphCut, Belief-Propagation or Primal-Dual meth¬ 
ods, etc. A complete review of recent inference methods can be found 
in [1201 • Original CRF or MRF models are usually limited to 4-neighbor 
or 8-neighbor. Recently, the fully connected graphical model which 
connects all pixels has also become popular due to the availability of 
efficient approximation of the time-costly message-passing step via fast 
image filtering ra, with the requirement that the pairwise term should 
be a mixture of Gaussian kernels. Vineet et al. nza introduced the 
higher order term to the fully connected CRF framework, which gen¬ 
eralizes its application. 

• Data Driven: The current pipeline needs pre-trained classifier, which 
is quite restrictive when new classes are included in the database. Re¬ 
cently some researchers have advocated for non-parametric, data-driven 
approach for open-universe datasets. Such approaches avoid training 
by retrieving similar training images from the database for segment¬ 
ing the new image. Liu et al. [28] proposed to use SIFT-flow [122] to 
transfer masks from train images. On the other hand, Tighe et al. PS 
proposed to retrieve nearest superpixel neighbor in training images and 
achieved comparable performance to [28] . 

7. Image Cosegmentation 

Cosegmentation aims at extracting common objects from a set of images 
(see Fig. [5] for examples). It is essentially a multiple-image segmentation 
problem, where the very weak prior that the images contain the same objects 
is used for automatic object segmentation. Since it does not need any pixel 
level or image level object information, it is suitable for large scale image 
dataset segmentation and many practical applications, which attracts much 
attention recently. At the meanwhile, cosegmentation is challenging, which 
faces several major issues: 
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Figure 5: (a) An example of simultaneously segmenting one common foreground object 

from a set of image [38]. (b) An example of multi-class cosegmentation [33]. 


• The classical segmentation models are designed for a single image while 
cosegmentation deals with multiple images. How to design the coseg- 
mentation model and minimize the model is critical for cosegmentation 
research. 

• The common object extraction depends on the foreground similarity 
measurement. But the foreground usually varies, which makes the 
foreground similarity measurement difficult. Moreover, the model min¬ 
imization is highly related to the selection of similarity measurement, 
which may result in extremely difficult optimization. Thus, how to 
measure the foreground similarities is another important issue. 

• There are many practical applications with different requirements such 
as large-scale image cosegmentation, video cosegmentation and web im¬ 
age cosegmentation. Individual special needs require a cosegmentation 
model to be extendable to different scenarios, which is challenging. 

Cosegmentation was first introduced by Rother et al. [123j in 2006. Af¬ 
ter that, many cosegmentation methods have been proposed [33, 311 [35, 36j 
EH EEl ESI [40J. The existing methods can be roughly classified into three 
categories. The first one is to extend the existing single image based segmen¬ 
tation models to solve the cosegmentation problem, such as MRF cosegmen¬ 
tation, heat diffusion cosegmentation, Random Walk based cosegmentation 
and active contour based cosegmentation. The second one is to design new 
cosegmentation models, such as formulating the cosegmentation as clustering 
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problem, graph theory based proposal selection problem, metric rank based 
representation. The last one is to solve new emerging cosegmentation needs, 
such as multiple class/foreground object cosegmentation, large scale image 
cosegmentation and web image cosegmentation. 


7.1. Cosegmentation by extending single-image segmentation models 

A straight-forward way for cosegmentation is to extend classical single 
image based segmentation models. In general, the extended models can be 
represented as 

E = E s + Eg (29) 


where E s is the single image segmentation term, which guarantees the smooth¬ 
ness and the distinction between foreground and background in each image, 
and Eg is the cosegmentation term, which focuses on evaluating the consis¬ 
tency between the foregrounds among the images. Only the segmentation 
of common objects can result in small values of both E s and E g . Thus, the 
cosegmentation is formulated as minimizing the energy in (29). 

Many classical segmentation models have been used to form E s , e.g. using 
MRF segmentation models [1231 30 j as 


? MRF 


= E. 


MRF 


+ E, 


MRF 


(30) 


where Eff RF and E^ IRF are the conventional unary potential and the pair¬ 
wise potential. GraphCut algorithm is widely used to minimize the energy 
in (30). The cosegmentation term E g is used to evaluate the multiple fore¬ 


ground consistency, which is introduced to guarantee the common object 


segmentation. However, it makes the minimization of (29) difficult. Various 


cosegmentation terms and their minimization methods have been carefully 
designed in MRF based cosegmentation models. 

In particular, Rother et al. [123] evaluated the consistency by i\ norm, 
i.e., Eg = J2,(Mz) — h 2 (z)\), where hi and h 2 are features of the two 
foregrounds, and z is the dimension of the feature. Adding the l\ evalua¬ 
tion makes the minimization quite challenging. An approximation method 
called submodular-supermodular procedure has been proposed to minimize 
the model by max-flow algorithm. Mukherjee et al. replaced i\ with 
squared l 2 evaluation, i.e. E g = XL(l^i( 2 ) — h 2 (z)\) 2 . The t 2 has several ad¬ 
vantages, such as relaxing the minimization to LP problem and using Pseudo- 
Boolean optimization method for minimization. But it is still an approxi¬ 
mation solution. In order to simplify the model minimization, Hochbaum et 


26 









al. [ 124 ] used reward strategy rather than penalty strategy. The rationale 
is that given any foreground pixel p in the first image, the similar pixel q 
in the second image will be rewarded as foreground. The global term was 
formulated as E g = ^i( z ) ' ^ 2 ( 2 ), which is similar to the histogram based 

segment similarity measurement in [TS|, to force each foreground to be simi¬ 
lar with the other foregrounds and also be different with their backgrounds. 
The energy generated with MRF model is proved to be submodular and can 
be efficiently solved by GraphCut algorithm. Rubio et al. [36] evaluated the 
foreground similarities by high order graph matching, which is introduced 
into MRF model to form the global term. Batra et al. [125] firstly proposed 
an interactive cosegmentation, where an automatic recommendation system 
was developed to guide the user to scribble the uncertain regions for coseg- 
mentation refinement. Several cues such as uncertainty based cues, scribble 
based cues and image level cues are combined to form a recommendation 
map, where the regions with larger values are suggested to be labeled by the 
user. 

Apart from MRF segmentation model, Collins et al. [ 126 ] extended Ran- 
domWalk model to solve cosegmentation, which results in a convex minimiza¬ 
tion problem with box constraints and a GPU implementation. In [127] . 
active contour based segmentation is extended for cosegmentation, which 
consists of the foreground consistencies across images and the background 
consistencies within each image. Due to the linear similarity measurement, 
the minimization can be resolved by level set technique. 

7.2. New cosegmentation models 

The second category try to solve cosegmentation problem using new 
strategies rather than extending existing single segmentation models. For 
example, by treating the common object extraction task as a common region 
clustering problem, the cosegmentation problem can be solved by clustering 
strategy. Joulin et al. [37] treated the cosegmentation labeling as training 
data, and trained a supervised classifier based on the labeling to see if the 
given labels are able to result in maximal separation of the foreground and 
background classes. The cosegmentation is then formulated as searching 
the labels that lead to the best classification. Spectral method (Normalized 
Cuts) is adopted for the bottom-up clustering, and discriminative clustering 
is used to share the bottom-up clustering among images. The two cluster¬ 
ings are finally combined to form the cosegmentation model, which is solved 
by convex relaxation and efficient low-rank optimization. Kim et al. H2Ej 
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solved cosegmentation by clustering strategy, which divides the images into 
hierarchical superpixel layers and describes the relationship of the superpix¬ 
els using graph. Affinity matrix considering intra-image edge affinity and 
inter-image edge affinity is constructed. The cosegmentation can then be 
solved by spectral clustering strategy. 

By representing the region similarity relationships as edge weights of a 
graph, graph theory has also been used to solve cosegmentation. In [129] . 
by representing each image as a set of object proposals, a random forest 
regression based model is learned to select the object from backgrounds. 
The relationships between the foregrounds are formulated by fully connected 
graph, and the cosegmentation is achieved by loop belief propagation. Meng 
et al. [ 130] constructed directed graph structure to describe the foreground 
relationship by only considering the neighbouring images. The object coseg- 
mentation is then formulated as s shortest path problem, which can be solved 
by dynamic programming. 

Some methods try to learn the prior of common objects, which is then 
used to segment the common objects. Sun et al. |131] solved cosegmenta¬ 
tion by learning discriminative part detectors of the object. By forming the 
positive and negative training parts from the given images and other images, 
the discriminative parts arc learned based on the fact that the part detector 
of the common objects should more frequently appear in positive samples 
than negative samples. The problem is finally formulated as a latent SVM 
model learning problem with group sparsity regularization. Dai et al. D321 
proposed coskech model by extending the active basis model to solve coseg- 
mentation problem. In cosketch, a deformable shape template represented 
by codebook is generated to align and extract the common object. The tem¬ 
plate is introduced into unsupervised learning process to iteratively sketch 
the images based on the shape and segment cues and re-learn the shape and 
segment templates with the model parameters. By cosketch, similar shape 
objects can be well segmented. 

There are also some other strategies. Faktor et al. ra solved cosegmen¬ 
tation based on the similarity of composition, where the likelihood of the 
co-occurring region is high if it is non-trivial and has good match with other 
compositions. The co-occurring regions between images are firstly detected. 
Then, consensus scoring by transferring information among the regions is 
performed to obtain cosegments from co-occurring regions. Mukherjee et 
al. [39] evaluated the foreground similarity by forcing low entropy of the 
matrix comprised by the foreground features, which can handle the scale 
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(a) Input images 






(b) Cosegmentation results 


Figure 6: Given a user’s photo stream about certain event, which consists of finite objects, 
e.g. red-girl, blue-girl, blue-baby and apple basket. Each image contains an unknown 
subset of them, which we called “Multiple Foreground Cosegmentation'''’ problem. Kim 
and Xing 1331 proposed the first method to extract irregularly occurred objects from the 
photo stream (b) by using a few bounding boxes provided by the user in (a). 


variation very well. 

7.3. New cosegmentation problems 

Many applications require the cosegmentation on a large-scale set of im¬ 
ages, which is extremely time consuming. Kim et al. [Mj solved the large- 
scale cosegmentation by temperature maximization on anisotropic heat dif¬ 
fusion (called CoSand), which starts with finite sources and performs heat 
diffusion among images based on superpixels. The model is sub modular 
with fast solver. Wang et al. [ 134] proposed a semi-supervised learning based 
method for large scale image cosegmentation. With very limited training 
data, the cosegmentation is formed based on the terms of inter-image dis¬ 
tance to measure the foreground consistencies among images, the intra-image 
distance to evaluate the segmentation smoothness within each image and the 
balance term to avoid the same label of all superpixels. The model is con¬ 
verted and approximated to a convex QP problem and can be solved in 
polynomial time using active set. Zhu et al. P35] proposed the first method 
which uses search engine to retrieve similar images to the input image to 
analyze the object-of-interest information, and then uses the information to 
cut-out the object-of-interest. Rubinstein et al. [136] observed that there are 
always noise images (which do not contain the common objects) from web 
image dataset, and proposed a cosegmentation model to avoid the noise im¬ 
ages. The main idea is to match the foregrounds using SIFT flow to decide 
which images do not contain the common objects. 

Applying cosegmentation to improve image classification is an important 
application of cosegmentation. Chai et al. D33 proposed a bi-level coseg- 
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mentation method, and used it for image classification. It consists of bottom 
level obtained by GrabCut algorithm to initialize the foregrounds and the top 
level with a discriminative classification to propagate the information. Later, 
a TriCos model [IBS] containing three levels were further proposed, including 
image level (GrabCut), dataset-level, and category level, which outperforms 
the bi-level model. 

The sea of personal albums over internet, which depicts events in short 
periods, makes one popular source of large image data that need further 
analysis. Typical scenario of a personal album is that it contains multiple 
objects of different categories in the image set and each image can contain 
a subset of them. This is different from the classical cosegmentation models 
introduced above, which usually assume each image contains the same com¬ 
mon foreground while having strong variations in backgrounds. The problem 
of extracting multiple objects from personal albums is called “ Multiple Fore¬ 
ground Co segmentation" (MFC) (see Fig. [6|. Kim and Xing [33] proposed 
the first method to handle MFC problem. Their method starts from building 
apperance models (GMM & SPM) from user-provided bounding boxes which 
enclose the object-of-interest, and over-segments the images into superpix¬ 
els by p2|. Then they used beam search to find proposal candidates for 
each foreground. Finally, the candidates are seamed into non-overlap regions 
by using dynamic programming. Ma et al. [IBS] formulated the multiple 
foreground cosegmentation as semi-supervised learning (graph transduction 
learning), and introduced connectivity constraints to enforce the extraction 
of connected regions. A cutting-plane algorithm was designed to efficiently 
minimize the model in polynomial time. 

Kim and Ma’s methods hold an implicit constraint on objects using low- 
level cues, and therefore their method might assign labels of “stuff’ (grass, 
sky) to “thing” (people or other objects). Zhu et al. |[l40j proposed a princi¬ 
pled CRF framework which explicitly expresses the object constraints from 
object detectors and solves an even more challenging problem: multiple fore¬ 
ground recognition and cosegmentation (MFRC). They proposed an extended 
multiple color-line based object detector which can be on-line trained by using 
user-provided bounding boxes to detect objects in unlabeled images. Finally, 
all the cues from bottom-up pixels, middle-level contours and high-level ob¬ 
ject detectors are integrated in a robust high-order CRF model, which can 
enforce the label consistency among pixels, superpixels and object detections 
simultaneously, produce higher accuracy in object regions and achieve state- 
of-the-art performance for the MFRC problem. Later, Zhu et al. [141] further 
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studied another challenging multiple human identification and cosegmenta¬ 
tion problem, and proposed a novel shape cue which uses geodesic filters [83] 
and joint-bilateral filters to transform the blurry response maps from multiple 
color-line object detectors and poselet models to edge-aligned shape prior. It 
leads to promising human identification and co-segmentation performance. 

8. Dataset and Evaluation Metrics 

8.1. Datasets 

To inspire new methods and objectively evaluate their performance for 
certain applications, different datasets and evaluation metrics have been pro¬ 
posed. Initially, the huge labeling cost limits the size of the datasets |1421 20] 
(typically in hundreds of images). Recently, with the popularity of crowd¬ 
sourcing platform such as Amazon Merchant Turk (AMT) and LabelMe [143] , 
the label cost is shared over the internet users, which makes large datasets 
with millions of images and labels possible. Below we summarize the most 
influential datasets which are widely used in the existing segmentation liter¬ 
ature ranging from bottom-up image segmentation to holistic scene under¬ 
standing: 

8.1.1. Single image segmentation datasets 

Berkeley segmentation benchmark dataset (BSDS) [ 142] is one of 
the earliest and largest datasets for contour detection and single image object- 
agnostic segmentation with human annotation. The latest BSDS dataset 
contains 200 images for training, 100 images for validation and the rest 200 
images for testing. Each image is annotated by at least 3 subjects. Though 
the size of the dataset is small, it still remains one of the most difficult 
segmentation datasets as it contains various object classes with great pose 
variation, background clutter and other challenges. It has also been used to 
evaluate superpixel segmentation methods. Recently, Li et al. [ 144] proposed 
a new benchmark based on BSDS, which can evaluate semantic segmentation 
at object or part level. 

MSRC-interactive segmentation dataset [20] includes 50 images 
with a single binary ground-truth for evaluating interactive segmentation 
accuracy. This dataset also provides imitated inputs such as labeling-lasso 
and rectangle with labels for background, foreground and unknown areas. 
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8.1.2. Cosegmentation datasets 

MSRC-cosegmentation dataset [123] has been used to evaluate image- 
pair binary cosegmentation. The dataset contains 25 image pairs with sim¬ 
ilar foreground objects but heterogeneous backgrounds, which matches the 
assumptions of early cosegmentation methods [123, 11241140] . Some pairs of 
the images are picked such that they contain some camouflage to balance 
database bias which forms the baseline cosegmentation dataset. 

iCoseg dataset [125] is a large binary-class image cosegmentation dataset 
for more realistic scenarios. It contains 38 groups with a total of 643 images. 
The content of the images ranges from wild animals, popular landmarks, 
sports teams to other groups containing similar foregrounds. Each group 
contains images of similar object instances from different poses with some 
variations in the background. iCoseg is challenging because the objects are 
deformed considerably in terms of viewpoint and illumination, and in some 
cases, only a part of the object is visible. This contrasts significantly with 
the restrictive scenario of MSRC-Cosegmentation dataset. 

FlickrMFC dataset [33] is the only dataset for multiple foreground 
cosegmentation, which consists of 14 groups of images with manually la¬ 
beled ground-truth. Each group includes 10~20 images which are sampled 
from a Flikcr photostream. The image content covers daily scenarios such as 
children-playing, fishing, sports, etc. This dataset is perhaps the most chal¬ 
lenging cosegmentation dataset as it contains a number of repeating subjects 
that are not necessarily presented in every image and there are strong oc¬ 
clusions, lighting variations, or scale or pose changes. Meanwhile, serious 
background clutters and variations often make even state-of-the-art object 
detectors failing on these realistic scenarios. 

8.1.3. Video segmentation/cosegmentation datasets 

SegTrack dataset [ 1145] is a large binary-class video segmentation dataset 
with pixel-level annotation for primary objects. It contains six videos (bird, 
bird-fall, girl, monkey-dog, parachute and penguin). The dataset contains 
challenging cases including foreground/background occlusion, large shape de¬ 
formation and camera motion. 

CVC binary-class video cosegmentation dataset [ 146] contains 4 
synthesis videos which paste the same foreground to different backgrounds 
and 2 videos sampled from the SegTrack. It forms a restrictive dataset for 
early video cosegmentation methods. 

MPI multi-class video cosegmentation dataset [147] was proposed 
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to evaluate video cosegmentation approaches in more challenging scenarios, 
which contain multi-class objects. This challenging dataset contains 4 differ¬ 
ent video sets sampled from Youtube including 11 videos with around 520 
frames with ground truth. Each video set has different numbers of object 
classes appearing in them. Moreover, the dataset includes challenging light¬ 
ing, motion blur and image condition variations. 

Cambridge-driving video dataset (CamVid) [148] is a collection of 
videos with labels of 32 semantic classes (e.g. building, tree, sidewalk, traffic 
light, etc), which are captured by a position-fixed CCTV-camera on a driving 
automobile over 10 mins at 30Hz footage. This dataset contains 4 video 
sequences, with more than 700 images at resolution of 960 x 720. Three of 
them are sampled at the day light condition and the remaining one is sampled 
at the dark. The number and the heterogeneity of the object classes in each 
video sequence are diverse. 

8.1.4- Static scene parsing datasets 

MSRC 23-class dataset [23] consists of 23 classes and 591 images. 
Due to the rarity of ‘horse’ and ‘mountain’ classes, these two classes are 
often ignored for training and evaluation. The remaining 21 classes contain 
diverse objects. The annotated ground-truth is quite rough. 

PASCAL VOC dataset [149] provides a large-scale dataset for evalu¬ 
ating object detection and semantic segmentation. Starting from the initial 
4-class objects in 2005, now PASCAL dataset includes 20 classes of objects 
under four major categories (animal, person, vehicle and indoor). The latest 
train/val dataset has 11,530 images and 6,929 segmentations. 

LabelMe + SUN dataset: LabelMe [ 143] is initiated by the MIT 
CSCAIL which provides a dataset of annotated images. The dataset is still 
growing. It contains copyright-free images and is open to public contribu¬ 
tion. As of October 31, 2010, LabelMe has 187,240 images, 62,197 annotated 
images, and 658,992 labeled objects. SUN [ 150] is a subsampled dataset from 
LabelMe. It contains 45,676 image (21,182 indoor and 24,494 outdoor), total 
515 object categories. One noteworthy point is that the number of objects in 
each class is uneven, which can cause unsatisfactory segmentation accuracy 
for rare classes. 

SIFT-flow dataset: The popularity of nonparametric scene parsing re¬ 
quires a large labeled dataset. The SIFT Flow dataset [ 122 ] is composed of 
2,688 images that have been thoroughly labelled by LabelMe users. Liu et 
al. H22| have split this dataset into 2,488 training images and 200 test images 
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and used synonym correction to obtain 33 semantic labels. 

Stanford background dataset [27] consists of around 720 images sam¬ 
pled from the existing datasets such as LabelMe, MSRC and PASCAL VOC, 
whose content consists of rural, urban and harbor scenes. Each image pixel 
is given two labels: one for its semantic class (sky, tree, road, grass, water, 
building, mountain and foreground) and the one for geometric property (sky, 
vertical, and horizontal). 

NYU dataset [15Tj] : The NYU-depth V2 dataset is comprised of video 
sequences from a variety of indoor scenes recorded by both the RGB and 
depth cameras of Microsoft Kinect. R features 1449 densely labeled pairs 
of aligned RGB and depth images, 464 new scenes taken from 3 cities, and 
407,024 new unlabeled frames. Each object is labeled with a class and an 
instance number (cupl, cup2, cup3, etc). 

Microsoft COCO |152j is a recent dataset for holistic scene understand¬ 
ing, which provides 328K images with 91 object classes. One substantial dif¬ 
ference with other large datasets, such as PASCAL VOC and SUN datasets, 
is that Microsoft COCO contains more labelled instances in million units. 
The authors argue that it can facilitate training object detectors with better 
localization, and learning contextual information. 

8.2. Evaluation metrics 

As segmentation is an ill-defined problem, how to evaluate an algorithm’s 
goodness remains an open question. In the past, the evaluation were mainly 
conducted through subjective human inspections or by evaluating the per¬ 
formance of subsequent vision system which uses image segmentation. To 
objectively evaluate a method, it is desirable to associate the segmentation 
with perceptual grouping. Current trend is to develop a benchmark [142] 
which consists of human-labeled segmentation and then compares the algo¬ 
rithm’s output with the human-labeled results using some metrics to measure 
the segmentation quality. Various evaluation metrics have been proposed: 

• Boundary Matching: This method works by matching the algorithm¬ 
generated boundaries with human-labeled boundaries, and then com¬ 
puting some metric to evaluate the matching quality. Precision and 
recall framework proposed by Martin et al. [153 ] is among the widely 
used evaluation metrics, where the Precision measures the proportion 
of how many machine-generated boundaries can be found in human- 
labeled boundaries and is sensitive to over-segmentation, while the Re¬ 
call measures the proportion of how many human-labelled boundaries 
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can be found in machine-generated boundaries and is sensitive to under¬ 
segmentation. In general, this method is sensitive to the granularity of 
human labeling. 


• Region Covering : This method [153] operates by checking the over¬ 
lap between the machine-generated regions and human-labelled regions. 
Let S seg and Shum denote the machine segmentation and the human 
segmentation, respectively. Denote the corresponding segment regions 
for pixel from the pixel set P = {p\,p 2 , ...,p n } as C(S seg ,Pi ) and 
C(Sh um ,Pi)- The relative region covering error at Pi is 


OiS.^S, 


n i.i r>- 


\C(S seq , Pi )\C(S h 


where \ is the set differencing operator. 

The globe region covering error is defined as: 
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However, when each pixel is a segment or the whole image is a segment, 
the GCE becomes zero which is undesirable. To alleviate these prob¬ 
lems, the authors proposed to replace min operation by using max 
operation, but such change will not encourage segmentation at finer 
detail. 

Another commonly used region based criterion is the Intersection-over- 
Union, by checking the overlap between the S seg and Shum'- 

IoU(S, e „S hum ) = (33) 

^Seg U &hum 

• Variation of Information (VI): This metric [154] measures the dis¬ 
tance between two segmentations S seg and Shum using average con¬ 
ditional entropy. Assume that S seg and Shum have clusters C seg = 
{Cleg, Cleg,--, cf eg } and C hum = {Cl um ,C% um ,...,Cf£ m }, respectively. 
The variation of information is defined as: 

VI (S seg , S hum ) = H(S seg ) + H(S hum ) - 21 (S seg , Shum) (34) 
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where H(S seg ) is the entropy associated with clustering S seg : 


K 

H(S seg ) = -J2 P ^ l °9 P W 

k=1 


( 35 ) 


with P(k) = —, where Uk is the number of elements in cluster k and 
n is the total number of elements in S seg . I(S seg , Shum) is the mutual 
information between S seg and Shum- 
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with P(k,k') = 

Although VI posses some interesting property, its perceptual mean¬ 
ing and potential in evaluating more than one ground-truth are un¬ 
known [38] . 

• Probabilistic Random Index (PRI): PRI was introduced to mea¬ 
sure the compatibility of assignments between pairs of elements in S seg 
and Shum■ It has been defined to deal with multiple ground-truth seg¬ 
mentations H55h 

PRI(S seg , S hum ) = jy y ] [CjjPij + (1 — Cij)( 1 — Pij)] (37) 

2 i<j 

where Sfc um is the k-th human-labeled ground-truth, c tJ is the event 
that pixels i and j have the same label and Pi 3 is the corresponding 
probability. As reported in [50], the PRI has a small dynamic range 
and the values across images and algorithms are often similar which 
makes the differentiation difficult. 

9. Discussions and Future Directions 

9.1. Design flavors 

When designing a new segmentation algorithm, it is often difficult to 
make choices among various design flavors, e.g. to use superpixel or not, to 
use more images or not. It all depends on the applications. Our literature 
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review has cast some lights on the pros and cons of some commonly used 
design flavors, which is worth thinking twice before going ahead for a specific 
setup. 

Patch vs. region vs. object proposal: Careful readers might notice 
that there has been a significant trend in migrating from patch based analysis 
to region based (or superpixel based) analysis. The continuous performance 
improvement in terms of boundary recall and execution time makes super¬ 
pixel a fast preprocessing technique. The advantages of using superpixel lie 
in not only the time reduction in training and inference but also more com¬ 
plex and discriminative features that can be exploited. On the other hand, 
superpixel itself is not perfect, which could introduce new structure errors. 
For users who care more about visual results on the segment boundary, pixel- 
based or hybrid approach of combining pixel and superpixel should be con¬ 
sidered as better options. The structure errors of using superpixel can also 
be alleviated by using different methods and parameters to produce multiple 
over-segmentation or using fast edge-aware filtering to refine the boundary. 
For users more caring about localization accuracy, the region based way is 
more preferred due to the various advantages introduced while the boundary 
loss can be neglected. Another uprising trend that is worth mentioning is the 
application of the object region proposals [ 129 , 130] . Due to the larger sup¬ 
port provided by object-like regions than oversegmentation or pixels, more 
complex classifiers and region-level features can be extracted. However, the 
recall rate of the object proposal is still not satisfactory (around 60%); there¬ 
fore more careful designs need to be made when accuracy is a major concern. 

Intrinsic cues vs. extrinsic cues: Although intrinsic cues (the features 
and prior knowledge for a single image) still play dominant roles in existing 
CV applications, extrinsic cues which come from multiple images (such as 
multiview images, video sequence, and a super large image dataset of similar 
products) are attracting more and more attentions. An intuitive answer why 
extrinsic cues convey more semantics can be interpreted in terms of statis¬ 
tic. When there are many signals available, the signals which repeatedly 
appear form patterns of interest, while those errors are averaged. Therefore, 
if there are multiple images containing redundant but diverse information, 
incorporating extrinsic cues should bring some improvements. When taking 
extrinsic cues, the source of information needs to be considered in the algo¬ 
rithm design. More robust constraints such as the cues from multiple-view 
geometric or spatial-temporal relationships should be exploited first. When 
working with external information such as a large dataset which contains 
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heterogeneous data, a mechanism that can handle noisy information should 
be developed. 

Hand-crafted features vs. learned features: Hand-crafted features, 
such as intensity, color, SIFT and Bag-of-Word, etc, have played important 
roles in computer vision. These simple and training-free features have been 
applied to many applications, and their effectiveness has been widely exam¬ 
ined. However, the generalization capability of these hand-crafted features 
from one task to another task depends on the heuristic feature design and 
combination, which can compromise the performance. The development of 
low-level features has become more and more challenging. On the other 
hand, learned features from labelled database have recently been demon¬ 
strated advantages in some applications, such as scene understanding and 
object detection. The effectiveness of the learned features comes from the 
context information captured from longer spatial arrangement and higher 
order co-occurrence. With labelled data, some structured noise is eliminated 
which helps highlight the salient structures. However, learned features can 
only detect patterns for certain tasks. If migrating to other tasks, it needs 
newly labelled data, which is time consuming and expensive to obtain. There¬ 
fore, it is suggested to choose features from the handcraft features first. If it 
happens to have labelled data, then using learned features usually boost up 
the performance. 

9.2. Promising future directions 

Based on the discussed literature and the recent development in segmen¬ 
tation community, here we suggest some future directions which is worth for 
exploring: 

Beyond single image: Although single image segmentation has played 
a dominant role in the past decades, the recent trend is to use the internal 
structures of multiple images to facilitate segmentation. When human per¬ 
ceives a scene, the motion cues and depth cues provide extra information, 
which should not be neglected. For example, when the image data is from a 
video, the 3D geometry estimated from the structure-from-motion techniques 
can be used to help image understanding [1561 1157] . The depth information 
captured by commodity depth cameras such as Kinect also benefits the in¬ 
door scene understanding [158, 1159] . Beyond 3D information, the useful 
information from other 2D images in a large database, either organized (like 
ImageNet) or un-organized (like product or animal category), or multiple 
heterogeneous datasets of the same category, can also be exploited [1601 161j . 
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Toward multiple instances: Most segmentation methods still aim to 
produce a most likely solution, e.g. all pixels of the same category are as¬ 
signed the same label. On the other hand, people are also interested in know¬ 
ing the information of “how many” of the same category objects are there, 
which is a limitation of the existing works. Recently, there are some works 
making efforts towards this direction by combining state-of-the-art detectors 
with the CRF modelling |1141 ' 162, 163] which have achieved some progress. 
In addition, the recently developed dataset, Microsoft COCO |152j . which 
contains many images with labelled instances, can be expected to boost the 
development in this direction. 

Become more holistic: Most existing works consider the image la¬ 
belling task alone [ 104 . 23] . On the other hand, studying a related task can 
improve the existing scene analysis problem. For example, by combining 
the object detection with image classification, Yao et al. [164] proposed a 
more robust and accurate system than the ones which perform single task 
analysis. Such holistic understanding trend can also be seen from other 
works, by combining context |110L 1165] . geometry [271 ESS Q2ZI H67] , at¬ 
tributes |168[ 116911170] 1171] 1172] or even language [173] 1174] . It is therefore 
expected that a more holistic system should lead to better performance than 
those performing monotone analysis, though at an increasing inference cost. 

Go from shallow to deep: Feature has been played an important role in 
many vision applications, e.g. stereo matching, detection and segmentation. 
Good features, which are highly discriminative to differentiate ‘object’ from 
‘stuff’, can help object segmentation significantly. However, most features 
today are hand-crafted designed for specific tasks. When applied to other 
tasks, the hand-crafted features might not generalize well. Recently, learned 
features using multiple-layer neural network [175 ] have been applied in many 
vision tasks, including object classification ]176j. face detection P2Z| , pose 
estimation [ T78j . object detection [179] and semantic segmentation [106]. and 
achieved state-of-the-art performance. Therefore, it is interesting to explore 
whether such learned features can benefit the aforementioned more complex 
tasks. 

10. Conclusion 

In this paper, we have conducted a thorough review of recent develop¬ 
ment of image segmentation methods, including classic bottom-up methods, 
interactive methods, object region proposals, semantic parsing and image 
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cosegmentation. We have discussed the pros and cons of different methods, 
and suggested some design choices and future directions which are worthwhile 
for exploration. 

Segmentation as a community has achieved substantial progress in the 
past decades. Although some technologies such as interactive segmentation 
have been commercialized in recent Adobe and Microsoft products, there is 
still a long way to make segmentation a general and reliable tool to be widely 
applied to practical applications. This is more related to the existing meth¬ 
ods’ limitations in terms of robustness and efficiency. For example, one can 
always observe that given images under various degradations (strong illumi¬ 
nation change, noise corruption or rain situation), the performance of the 
existing methods could drop significantly [180] , With the rapid improvement 
in computing hardware, more effective and robust methods will be devel¬ 
oped, where the breakthrough is likely to come from the collaborations with 
other engineering and science communities, such as physics, neurology and 
mathematics. 
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